Skip to content

add eval capes to sdk#460

Open
luke-e-schaefer wants to merge 11 commits into
masterfrom
add-eval-capabilities
Open

add eval capes to sdk#460
luke-e-schaefer wants to merge 11 commits into
masterfrom
add-eval-capabilities

Conversation

@luke-e-schaefer
Copy link
Copy Markdown

@luke-e-schaefer luke-e-schaefer commented May 12, 2026

resolves https://linear.app/scale-epd/issue/DE-7460

tests wont pass until https://github.com/scaleapi/scaleapi/pull/142963 is merged

Greptile Summary

This PR adds a full Evaluations V2 SDK surface to the Nucleus Python client — COCO-style detection metrics on model runs stored as evaluation_match_v2 rows. Three new NucleusClient methods (create_evaluation_v2, get_evaluation_v2, list_evaluations_v2) and an EvaluationV2 resource class cover the complete lifecycle.

  • EvaluationV2 (new dataclass): supports wait_for_completion(), charts() (mAP, confusion matrix, PR curve, TIDE), examples() (paginated TP/FP/FN rows), refresh(), and delete(). Status comparisons against the str, Enum EvaluationV2Status work correctly.
  • DTOs (EvaluationV2Charts, EvaluationV2ExamplesPage, EvaluationV2MatchExample, EvaluationV2FilterArgs): nullable fields that could be absent for FN/FP rows are correctly declared Optional with = None defaults; camelCase filter serialization is well-tested.
  • Tests: comprehensive unit coverage via mocked connections, including filter serialization, pagination, polling, delete, and error paths.

Confidence Score: 5/5

Safe to merge — new functionality only, no changes to existing paths, and nullable DTO fields are correctly handled.

The change is entirely additive: new files, new public exports, and three new NucleusClient methods that follow existing delegation patterns. The only finding is a wrong release-tag URL in CHANGELOG.md, which has no runtime impact. DTO nullable fields (iou, prediction_metadata, item_metadata) are correctly declared Optional, the str-enum status comparisons are sound, and the test suite covers the key code paths with mocked connections.

No files require special attention.

Important Files Changed

Filename Overview
nucleus/evaluation_v2.py New EvaluationV2 resource class with full lifecycle: create, poll, charts, examples, delete. Logic and comparisons are correct for str-enum status fields.
nucleus/data_transfer_object/evaluation_v2.py New Pydantic DTOs for filters, charts, and match examples. Nullable fields are correctly declared Optional with defaults; camelCase filter serialization helper is well-tested.
nucleus/init.py Adds create_evaluation_v2, get_evaluation_v2, and list_evaluations_v2 to NucleusClient; exports all new public types. Follows existing patterns for make_request/get/post delegation.
tests/test_evaluation_v2.py Unit tests covering filters, pagination, wait-for-completion, delete, and error paths using mocked connections. Good coverage of the new SDK surface.
CHANGELOG.md Adds 0.18.4 entry, but the hyperlink in the header incorrectly points to the v0.18.3 release tag instead of v0.18.4.
docs/index.rst Adds Evaluations V2 section with a working code example; correct Sphinx cross-references to new methods.

Sequence Diagram

sequenceDiagram
    participant User
    participant NucleusClient
    participant API

    User->>NucleusClient: create_evaluation_v2(model_run_id, ...)
    NucleusClient->>API: "POST modelRun/{id}/evaluationsV2"
    API-->>NucleusClient: "{evaluation_id}"
    NucleusClient->>API: "GET evaluationsV2/{evaluation_id}"
    API-->>NucleusClient: EvaluationV2 payload
    NucleusClient-->>User: EvaluationV2

    loop poll until terminal
        User->>NucleusClient: wait_for_completion()
        NucleusClient->>API: "GET evaluationsV2/{id}"
        API-->>NucleusClient: "{status}"
    end

    User->>NucleusClient: "charts(iou_threshold=0.5)"
    NucleusClient->>API: "GET evaluationsV2/{id}/charts?iouThreshold=0.5"
    API-->>NucleusClient: EvaluationV2Charts
    NucleusClient-->>User: EvaluationV2Charts

    User->>NucleusClient: "examples(match_type=FP, limit=20)"
    NucleusClient->>API: "POST evaluationsV2/{id}/examples"
    API-->>NucleusClient: EvaluationV2ExamplesPage
    NucleusClient-->>User: EvaluationV2ExamplesPage

    User->>NucleusClient: delete()
    NucleusClient->>API: "DELETE evaluationsV2/{id}"
    API-->>NucleusClient: 200/204
Loading

Fix All in Cursor Fix All in Claude Code Fix All in Codex

Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
CHANGELOG.md:8
The release-tag URL in the 0.18.4 header points to `v0.18.3` instead of `v0.18.4`, so the changelog link will resolve to the wrong release.

```suggestion
## [0.18.4](https://github.com/scaleapi/nucleus-python-client/releases/tag/v0.18.4) - 2026-05-28
```

Reviews (7): Last reviewed commit: "Merge branch 'add-eval-capabilities' of ..." | Re-trigger Greptile

@luke-e-schaefer luke-e-schaefer self-assigned this May 12, 2026
Comment thread nucleus/data_transfer_object/evaluation_v2.py Outdated
Comment thread nucleus/__init__.py Outdated
luke-e-schaefer and others added 2 commits May 12, 2026 13:49
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Comment thread nucleus/data_transfer_object/evaluation_v2.py Outdated
luke-e-schaefer and others added 3 commits May 12, 2026 14:03
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Comment thread nucleus/data_transfer_object/evaluation_v2.py
Copy link
Copy Markdown
Contributor

@edwinpav edwinpav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall nice work!

Two main things:

  1. I'd make sure that the user-facing docs/descriptions are not overly complex. Not everyone will know or even care about how the function works behind the scenes, just care what are the params, what are the returns, and the feature that the method provides.
  2. If you want to deploy a new sdk version with these changes, two more files need to be changed and added to this pr:
    1. CHANGELOG.md should be updated. The tag link that the CHANGELOG references will be created after this pr is merged into master. You'd add a new release with a new tag here: https://github.com/scaleapi/nucleus-python-client/releases. Feel free to ping for any questions! The process isn't super clear lol

    2. The sdk version under tool.poetry should be updated in pyproject.toml
      (see #457 as a reference pr)

Comment thread nucleus/__init__.py Outdated
Comment thread nucleus/__init__.py Outdated
Comment thread nucleus/evaluation_v2.py
self.__dict__.update(updated.__dict__)
return self

def wait_for_completion(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed because this is not integrated with NucleusJobs? I thought this type of functionality comes built in for the other async functions (dedup async also uses temporal)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct yeah I don't have any ties back to the nuc jobs currently (since this stuff isn't "technically" in nucleus)...I could set that up tho that would be simple

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh i see, ig if it's in the nucleus sdk might be worth doing that if it's simple. if it shows up on the nucleus jobs page ui that's probably fine but that's probably a call you have more context on to make

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah i think thats fine too. I'll run that in its own PR set tho after this one (i'll have to update scaleapi too)

Comment thread docs/index.rst Outdated
Comment thread nucleus/evaluation_v2.py Outdated
Comment thread nucleus/data_transfer_object/evaluation_v2.py
Comment thread nucleus/data_transfer_object/evaluation_v2.py Outdated
Comment thread nucleus/evaluation_v2.py Outdated
Comment thread tests/test_evaluation_v2.py
Comment thread nucleus/data_transfer_object/evaluation_v2.py
@luke-e-schaefer luke-e-schaefer requested a review from edwinpav May 28, 2026 22:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants